Systems and methods for dynamically managing virtual machines

ABSTRACT

Techniques for dynamic management of virtual machine environments are disclosed. For example, a technique for automatically managing a first set of virtual machines being hosted by a second set of physical machines comprises the following steps/operations. An alert is obtained that a service level agreement (SLA) pertaining to at least one application being hosted by at least one of the virtual machines in the first set of virtual machines is being violated. Upon obtaining the SLA violation alert, the technique obtains at least one performance measurement for at least a portion of the machines in at least one of the first set of virtual machines and the second set of physical machines, and a cost of migration for at least a portion of the virtual machines in the first set of virtual machines. Based on the obtained performance measurements and the obtained migration costs, an optimal migration policy is determined for moving the virtual machine hosting the at least one application to another physical machine.

FIELD OF THE INVENTION

This present invention generally relates to virtual machine environmentsand, more particularly, to techniques for dynamically managing virtualmachines.

BACKGROUND OF THE INVENTION

An important problem encountered in today's information technology (IT)environment is known as server sprawl. Because of unplanned growth, manydata centers today have large numbers of heterogeneous servers, eachhosting one application and often grossly under utilized.

A solution to this problem is a technique known as server consolidation.In general, server consolidation involves converting each physicalserver or physical machine into a virtual server or virtual machine(VM), and then mapping multiple VMs to a physical machine, thusincreasing utilization and reducing the required number of physicalmachines.

There are some critical runtime issues associated with a consolidatedserver environment. For example, due to user application workloadchanges or fluctuations, a critical problem often arises in theseenvironments. The critical problem is that end user applicationperformance degrades due to over utilization of critical resources insome of the physical machines. Accordingly, an existing allocation ofVMs to physical machines may no longer satisfy service level agreement(SLA) requirements. As is known, an SLA is an agreement between aservice customer (e.g., application owner) and a service provider (e.g.,application host) that specifies the parameters of a particular service(e.g., minimum quality of service level). As a result, VMs may need tobe reallocated to other physical machines. However, such a reallocationhas an associated migration cost. Existing consolidation approaches donot account for the cost of migration.

SUMMARY OF THE INVENTION

Principles of the present invention provide techniques for dynamicmanagement of virtual machine environments.

For example, in one aspect of the invention, a technique forautomatically managing a first set of virtual machines being hosted by asecond set of physical machines comprises the followingsteps/operations. An alert is obtained that a service level agreement(SLA) pertaining to at least one application being hosted by at leastone of the virtual machines in the first set of virtual machines isbeing violated. Upon obtaining the SLA violation alert, the techniqueobtains at least one performance measurement for at least a portion ofthe machines in at least one of the first set of virtual machines andthe second set of physical machines, and a cost of migration for atleast a portion of the virtual machines in the first set of virtualmachines. Based on the obtained performance measurements and theobtained migration costs, an optimal migration policy is determined formoving the virtual machine hosting the at least one application toanother physical machine.

These and other objects, features and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of server consolidation;

FIG. 2A illustrates a virtual server management methodology, accordingto an embodiment of the present invention;

FIG. 2B illustrates a virtual machine reallocation methodology,according to an embodiment of the present invention;

FIG. 3 illustrates an example mapping of virtual machines to physicalmachines; and

FIG. 4 illustrates a computing system in accordance with which one ormore components/steps of a virtual server management system may beimplemented, according to an embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The following description will illustrate the invention using anexemplary SLA-based service provider environment. It should beunderstood, however, that the invention is not limited to use with sucha particular environment. The invention is instead more generallyapplicable to any data processing or computing environment in which itwould be desirable to manage virtual servers used to perform such dataprocessing or computing operations.

It is to be appreciated that, as used herein, a “physical machine” or“physical server” refers to an actual computing device, while a “virtualmachine” or “virtual server” refers to a logical object that acts as aphysical machine. In one embodiment, the computing device may be aBlade™ available from International Business Machines Corporation(Armonk, N.Y.). A Blade™ includes a “thin” software layer called aHypervisor™, which creates the virtual machine. A physical machineequipped with a Hypervisor™ can create multiple virtual machines. Eachvirtual machine can execute a separate copy of the operating system, aswell as one or more applications.

As will be illustrated below, a methodology of the invention provides apolynomial time approximate solution for dynamic migration of virtualmachines (VMs) to maintain SLA compliance. Such a management methodologyminimizes associated cost of migration, allows dynamic addition orremoval of physical machines as needed (in order to reduce total cost ofownership). The approach of the methodology is an iterative approach,which improves upon the existing solution of allocating VMs to physicalmachines. Moreover, the approach is independent of application software,and works with virtual machines at the operating system level. Such amethodology can be used as a part of a larger management system, e.g.,the International Business Machines Corporation (Armonk, N.Y.) Directorsystem, by using its monitoring mechanism and producing event actionplans for automatic migration of VMs when needed.

Before describing an illustration of the inventive approach, anexplanation of the basic steps and features of the server consolidationprocess is given in the context of FIG. 1.

As shown in FIG. 1, physical servers 100-1 through 100-n each host aseparate application (App1 through Appn, respectively). However, asnoted below each box representing the server, each application onlyutilizes between 25% and 50% of the processing capacity of the server.Thus, each server is considered under utilized.

Each physical server (100-1 through 100-n) is converted (step 105) intoa virtual machine (VM1 through VMn denoted as 110-1 through 110-n,respectively) using available virtualization technology, e.g., availablefrom VMWare or XenSource (both of Palo Alto, Calif.). Servervirtualization is a technique that is well known in the art and is,therefore, not described in detail herein.

Multiple VMs are then mapped into a physical machine, using centralprocessing unit (CPU) utilization, memory usage, etc. as metrics forresource requirements, thus increasing the utilization and reducing thetotal number of physical machines required to support the original setof applications. That is, as shown, the VMs are mapped into a lessernumber of physical machines (120-1 through 120-i, where i is less thann). For example, App1 and App2 are now each hosted by server 120-1,which can be of the same processing capacity as server 100-1, but now ismore efficiently utilized.

Typically, at the end of this consolidation process, the data centerwill consist of a fewer number of homogeneous servers, each loaded withmultiple virtual machines (VMs), where each VM represents one of theoriginal servers. The benefit of this process is that heterogeneity andserver sprawl is reduced, resulting in less complex management processesand lower cost of ownership.

However, as mentioned above, there are several runtime issues associatedwith consolidated server environment. Due to application workloadchanges or fluctuations, end user application performance degrades dueto over utilization of critical resources in some of the physicalmachines. Thus, an existing allocation of VM to physical machines may nolonger satisfy SLA requirements. VMs may need to be reallocated tophysical machines, which have an associated migration cost. Existingconsolidation approaches do not account for the cost of migration.

Illustrative principles of the invention provide a solution to thisproblem using an automated virtual machine migration methodology that isdeployable to dynamically balance the load on the physical servers withan overall objective of maintaining application SLAs. We assume thatthere is a cost associated with each migration. The inventive solutionfinds the best set of virtual machine migrations that restores theviolated SLAs, and that minimizes the number of required physicalservers, and minimizes the migration cost associated with thereallocation.

The methodology assumes that SLAs are directly related to metrics of thehost, such as CPU utilization or memory usage. The inventive methodologyis embodied in a virtual server management system that monitors thosemetrics and if any of them exceeds a predetermined threshold for aphysical server or VM, one or more VMs from that physical machine ismoved to another physical machine in order to restore acceptable levelsof utilization. The VM chosen to be moved is the one with the smallestmigration cost, as will be explained below. The chosen VM is moved tothe physical machine which has the least residual capacity for theresource associated with that metric and is able to accommodate the VM.An overall objective is to maximize the variance of the utilizationacross all the existing physical servers, as will be explained below.The procedure is repeated until the SLA violation is corrected. Theoverall management methodology 200 is depicted on FIG. 2A, while thereallocation or migration methodology is illustrated in FIG. 2B.

As shown in FIG. 2A, a heterogeneous under-utilized server environment(block 210) is the input to server consolidation step 220. The serverconsolidation step 220 is the server virtualization process describedabove in the context of FIG. 1. Thus, the input to step 220 is dataindicative of the heterogeneous under-utilized server environment, suchas the environment including servers 100-1 through 100-n in FIG. 1. Thismay include information regarding the application running on thephysical server as well as server utilization information. Again, sincethe virtualization process is well known, as well as the data inputthereto, a further description of this process is not given herein. Theresult of the server consolidation step is a consolidated homogenousenvironment (block 230). That is, server consolidation step 220 outputsa mapping of multiple VMs to physical servers, which serves to reduceheterogeneity and server sprawl.

As further shown in FIG. 2A, utilization values are monitored. This isaccomplished by monitoring agents 240. That is, performance metrics ormeasurements such as CPU utilization, memory utilization, input/outpututilization of each server in the consolidated environment are measured.The agents may simply be one or more software modules that compile theseutilization values reported by the servers. It is to be appreciated thatwhile a utilization value may be from a physical machine or a virtualmachine, such values are preferably taken for both the physical machineand the virtual machine. For example, for three virtual machinesexecuting on one physical machine, the system gathers CPU utilizationvalues of the physical machine, three CPU utilization values denotingthe virtual machines CPU usage, and CPU utilization due to the overheadof the Hypervisor™.

These utilization values are then compared to threshold values in step250 to determine whether they are greater than, less than, or equal to,some predetermined threshold value for the respective type ofutilization that is being monitored (e.g., CPU utilization thresholdvalue, memory utilization value, input/output utilization value). Suchthresholds are preferably thresholds generated based on the SLA thatgoverns the agreed-upon requirements for hosting the application runningon the subject server. For example, the SLA may require that a responseto an end user query to an application running on the subject server beless than a certain number of seconds for a percentage of the requests.Based on knowledge of the processing capacity of the server, thisrequirement is easily translated into a threshold percentage for CPUcapacity. Thus, the subject server hosting the application should neverreach or exceed the threshold percent of its CPU capacity.

Accordingly, if the subject server is being under utilized (e.g., belowthe threshold value) or over utilized (e.g., greater than or equal tothe threshold value), then the computation step will detect thiscondition and generate an appropriate alert, if necessary. In thisembodiment, when the server is being over utilized, an SLA violationalert is generated.

If such an SLA violation alert is generated, a VM reallocationmethodology of the invention is then triggered in step 260. The input tothe VM reallocation methodology includes: (i) utilization values (e.g.,CPU, memory, I/O) as computed by the monitoring agents 240; (ii) SLAinformation related to the thresholds; (iii) metric thresholds ascomputed in the threshold computation step; and (iv) a weightcoefficient vector specifying the importance of each utilizationdimension to the overall cost function.

The cost of reallocation (also referred to as migration) of a VM isdefined as a dot product of a vector representing utilization and thevector representing the weight coefficient. For example, the dimensionof both these vectors would be 2, if we consider only two resourcemetrics, CPU utilization and memory usage. It is to be appreciated thatthese migration costs are computed and maintained by the reallocationcomponent 260 of the virtual server management system 200 or,alternatively, by another component of the system.

By way of example, assume that for a particular VM that the metrics are[0.2, 0.5], where 0.2 denotes 20% CPU utilization and 0.5 denotes 50%memory usage and the cost vector is [5, 10], signifying that, in thetotal cost of migration, the CPU usage has a weight of 5 and the memoryusage has a weight of 10. Then, the cost of migration for this exampleVM is: 0.2*5+0.5*10=6 units.

The reallocation methodology of 260 includes two steps. Assume PM₁, PM₂,. . . PM_(m) are the physical machines and V_(ij) is the j-th virtualmachine on PM_(i). For each physical machine PM_(i), the methodologymaintains a list of virtual machines allocated to PM_(i) ordered bynon-decreasing migration cost, i.e., the first VM, V_(i1), has thelowest cost. For each physical machine PM_(i), the methodologycalculates and stores a vector representing the residual capacity ofPM_(i). The methodology maintains the list of residual capacities innon-decreasing order of the L2 norms of the capacity vector. An exampleconfiguration of VMs and their parent physical machines is shown on FIG.3.

As mentioned above, the reallocation algorithm is triggered by one ofthe monitored utilization values exceeding one of the utilizationthresholds. Assume that a physical machine PM_(i) exhibits a condition,whereby one of the measured metrics (e.g., CPU utilization) exceeds theset threshold. According to the reallocation methodology (illustrated inFIG. 2B), one of the associated VMs of the threshold-exceeding physicalmachine is chosen to migrate to another physical machine in thefollowing manner:

(i) select the VM (e.g., VM_(ij)) which has associated therewith theleast migration cost (step 261);

(ii) select the physical machine (PM_(j)) which has the least residueresource vector, but enough to accommodate VM_(ij) (step 262);

(iii) instruct the virtual machine migration system (block 270 in FIG.2A) to move VM_(ij) to PM_(j); (step 263);

(iv) if no physical machine is available to accommodate the virtualmachine, a new physical machine is introduced into the server farm andVM_(ij) is mapped thereto (step 264); and

(v) recalculate the residue vectors and sort the VMs according to thecosts (step 265).

It is to be understood that the reallocation methodology (step 260)generates one or more migration instructions, e.g., move VM₁ from PM₁ toPM₂, remove server PM₃, etc. The virtual machine migration system (block270) then takes the instructions and causes them to be implemented inthe consolidated homogeneous server environment (block 230). It is to beunderstood that existing products can be used for the virtual machinemigration system, for example, the Virtual Center from VMWare (PaloAlto, Calif.).

The above heuristic is based on a goal of maximizing the variance of theresidue vector, so that the physical machines are as closely packed asSLA requirements will allow, thus leading to a high overall utilization,minimizing the cost of migration and minimizing the need for introducingnew physical machines. Further, it is to be appreciated that the virtualserver management procedure is iterative in nature, i.e., the steps ofFIG. 2A (and thus FIG. 2B) are repeated until all SLA violations areremedied. Still further, based on the iterative nature of themethodology, minimal migration moves are made for each triggering event.Also, the methodology serves to maximize physical server load variance.

FIG. 4 illustrates a computing system in accordance with which one ormore components/steps of the virtual server management techniques (e.g.,components and methodologies described in the context of FIGS. 1 through3) may be implemented, according to an embodiment of the presentinvention. It is to be understood that the individual components/stepsmay be implemented on one such computing system or on more than one suchcomputing system. In the case of an implementation on a distributedcomputing system, the individual computer systems and/or devices may beconnected via a suitable network, e.g., the Internet or World Wide Web.However, the system may be realized via private or local networks. Inany case, the invention is not limited to any particular network.

Thus, the computing system shown in FIG. 4 may represent one or moreservers or one or more other processing devices capable of providing allor portions of the functions described herein.

As shown, the computing system architecture 400 may comprise a processor410, a memory 420, I/O devices 430, and a network interface 440, coupledvia a computer bus 450 or alternate connection arrangement.

It is to be appreciated that the term “processor” as used herein isintended to include any processing device, such as, for example, onethat includes a CPU and/or other processing circuitry. It is also to beunderstood that the term “processor” may refer to more than oneprocessing device and that various elements associated with a processingdevice may be shared by other processing devices.

The term “memory” as used herein is intended to include memoryassociated with a processor or CPU, such as, for example, RAM, ROM, afixed memory device (e.g., hard drive), a removable memory device (e.g.,diskette), flash memory, etc.

In addition, the phrase “input/output devices” or “I/O devices” as usedherein is intended to include, for example, one or more input devices(e.g., keyboard, mouse, etc.) for entering data to the processing unit,and/or one or more output devices (e.g., display, etc.) for presentingresults associated with the processing unit.

Still further, the phrase “network interface” as used herein is intendedto include, for example, one or more transceivers to permit the computersystem to communicate with another computer system via an appropriatecommunications protocol.

Accordingly, software components including instructions or code forperforming the methodologies described herein may be stored in one ormore of the associated memory devices (e.g., ROM, fixed or removablememory) and, when ready to be utilized, loaded in part or in whole(e.g., into RAM) and executed by a CPU.

In any case, it is to be appreciated that the techniques of theinvention, described herein and shown in the appended figures, may beimplemented in various forms of hardware, software, or combinationsthereof, e.g., one or more operatively programmed general purposedigital computers with associated memory, implementation-specificintegrated circuit(s), functional circuitry, etc. Given the techniquesof the invention provided herein, one of ordinary skill in the art willbe able to contemplate other implementations of the techniques of theinvention.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

1. A method of automatically managing a first set of virtual machinesbeing hosted by a second set of physical machines, comprising the stepsof: obtaining an alert that a service level agreement (SLA) pertainingto at least one application being hosted by at least one of the virtualmachines in the first set of virtual machines is being violated; uponobtaining the SLA violation alert: obtaining at least one performancemeasurement for at least a portion of the machines in at least one ofthe first set of virtual machines and the second set of physicalmachines; obtaining a cost of migration for at least a portion of thevirtual machines in the first set of virtual machines; and determining,based on the obtained performance measurements and the obtainedmigration costs, an optimal migration policy for moving the virtualmachine hosting the at least one application to another physicalmachine.
 2. The method of claim 1, wherein the optimal policydetermining step further comprises the step of selecting a virtualmachine from the first set of virtual machines with the lowest migrationcost.
 3. The method of claim 2, wherein the optimal policy determiningstep further comprises the step of selecting a physical machine from thesecond set of physical machines that has a resource residue that is thelowest among the physical machines and that can accommodate the selectedvirtual machine.
 4. The method of claim 3, wherein the optimal policydetermining step further comprises the step of generating an instructionto move the selected virtual machine to the selected physical machine.5. The method of claim 4, wherein the optimal policy determining stepfurther comprises the step of recalculating resource residues for thesecond set of physical machines.
 6. The method of claim 5, wherein theoptimal policy determining step further comprises the step of sortingthe first set of virtual machines according to migration costs.
 7. Themethod of claim 3, wherein when the second set of physical machines doesnot include a physical machine that can accommodate the selected virtualmachine, mapping the selected virtual machine to a physical machine thatis not in the second set of physical machines.
 8. The method of claim 1,wherein at least a portion of the steps of the management method areiteratively performed until the SLA violation is remedied.
 9. Apparatusfor automatically managing a first set of virtual machines being hostedby a second set of physical machines, comprising: a memory; and at leastone processor coupled to the memory and operative to: (i) obtain analert that a service level agreement (SLA) pertaining to at least oneapplication being hosted by at least one of the virtual machines in thefirst set of virtual machines is being violated; and (ii) upon obtainingthe SLA violation alert: obtain at least one performance measurement forat least a portion of the machines in at least one of the first set ofvirtual machines and the second set of physical machines; obtain a costof migration for at least a portion of the virtual machines in the firstset of virtual machines, and determine, based on the obtainedperformance measurements and the obtained migration costs, an optimalmigration policy for moving the virtual machine hosting the at least oneapplication to another physical machine.
 10. The apparatus of claim 9,wherein the optimal policy determining operation further comprisesselecting a virtual machine from the first set of virtual machines withthe lowest migration cost.
 11. The apparatus of claim 10, wherein theoptimal policy determining operation further comprises selecting aphysical machine from the second set of physical machines that has aresource residue that is the lowest among the physical machines and thatcan accommodate the selected virtual machine.
 12. The apparatus of claim11, wherein the optimal policy determining operation further comprisesgenerating an instruction to move the selected virtual machine to theselected physical machine.
 13. The apparatus of claim 12, wherein theoptimal policy determining operation further comprises recalculatingresource residues for the second set of physical machines.
 14. Theapparatus of claim 13, wherein the optimal policy determining operationfurther comprises sorting the first set of virtual machines according tomigration costs.
 15. The apparatus of claim 11, wherein when the secondset of physical machines does not include a physical machine that canaccommodate the selected virtual machine, mapping the selected virtualmachine to a physical machine that is not in the second set of physicalmachines.
 16. The apparatus of claim 9, wherein at least a portion ofthe operations of the management apparatus are iteratively performeduntil the SLA violation is remedied.
 17. An article of manufacture forautomatically managing a first set of virtual machines being hosted by asecond set of physical machines, comprising a machine readable mediumcontaining one or more programs which when executed implement the stepsof: obtaining an alert that a service level agreement (SLA) pertainingto at least one application being hosted by at least one of the virtualmachines in the first set of virtual machines is being violated; uponobtaining the SLA violation alert: obtaining at least one performancemeasurement for at least a portion of the machines in at least one ofthe first set of virtual machines and the second set of physicalmachines; obtaining a cost of migration for at least a portion of thevirtual machines in the first set of virtual machines; and determining,based on the obtained performance measurements and the obtainedmigration costs, an optimal migration policy for moving the virtualmachine hosting the at least one application to another physicalmachine.